dynamic evaluation
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
f5bf0ba0a17ef18f9607774722f5698c-Supplemental.pdf
A How General Are These Findings? A.1 The effect of outdated models persists beyond the 2018/2019 test period. Following 2.2, we derive different A.2 The effect of outdated models persists beyond the two-year gap. For this experiment, we keep the same 2018-2019 test set introduced in 2.2, and train models with We test whether the temporal degradation trend is a generalizable pattern that holds across languages. In practice, our implementation of dynamic evaluation differs from Eq. 2 in two ways: (i) We perform The x-axis presents the years in a reverse chronological order.
- North America > Canada (0.14)
- North America > United States > Alabama (0.05)
- North America > United States > Texas (0.04)
- (5 more...)
Mind the Gap: Assessing Temporal Generalization in Neural Language Models
In the case of GPT -3 (Brown et al., 2020), such tasks include LAMBADA (Paperno et al., 2016), TriviaQA First, they do not assess a language model's ability to generalize well to future data from beyond their training period--an Augenstein et al., 2019), forecasting stock prices from the latest news articles (Ding et al., 2015), and answering knowledge-intensive questions like "How many people have been infected by COVID-19?" Second, the temporal overlap between the training and evaluation data increases the risk of "test data Nevertheless, language modelling data are not i.i.d. Brown et al. (2020) used This can potentially induce a correlation between the training and evaluation sets that LMs can exploit.
- Oceania > New Zealand (0.04)
- North America > United States > California > Monterey County > Pacific Grove (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)
- Government > Regional Government (0.67)
- Health & Medicine > Therapeutic Area (0.54)
- Education > Educational Setting (0.46)
Dynamic Evaluation for Oversensitivity in LLMs
Pu, Sophia Xiao, Cheng, Sitao, Wang, Xin Eric, Wang, William Yang
Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model's unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky
Hathidara, Ashutosh, Yu, Julien, Schreiber, Sebastian
Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.
- North America > Canada (0.14)
- North America > United States > Alabama (0.05)
- North America > United States > Texas (0.04)
- (5 more...)
- Oceania > New Zealand (0.04)
- North America > United States > California > Monterey County > Pacific Grove (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
- Government > Regional Government (0.67)
- Education > Educational Setting (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Communications (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
Dynamic Evaluation of Large Language Models by Meta Probing Agents
Zhu, Kaijie, Wang, Jindong, Zhao, Qinlin, Xu, Ruochen, Xie, Xing
Evaluation of large language models (LLMs) has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs. Code is available at: https://github.com/microsoft/promptbench.
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Alberta > Census Division No. 19 > Saddle Hills County (0.04)
- Asia > China (0.04)
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.55)
Collapse of Self-trained Language Models
In various fields of knowledge creation, including science, new ideas often build on pre-existing information. In this work, we explore this concept within the context of language models. Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions. While this approach is intuitively appealing, our research reveals its practical limitations. We find that extended self-training of the GPT-2 model leads to a significant degradation in performance, resulting in repetitive and collapsed token output.
- Europe > Czechia > Prague (0.05)
- North America > United States > California > Monterey County > Pacific Grove (0.04)
- Europe > Czechia > South Moravian Region > Brno (0.04)